The following process details how to conduct a power analysis for both the betaSharedTest and twoThetaTest on simulated gene expression data. The process utilizes the EVE model, described in Rohlfs, Nielsen 2015, implemented in the evemodel package. The first step is to simulate data under the null and alternative distributions. This requires a phylogeny over which to simulate, and parameter value definitions for \(\theta, \sigma^2, \alpha\), and \(\beta\). The output will include simulated data, a matrix of power values associated with each alternative distribution, and a log-scaled power curve with a point for the false positive rate.
Please note, there are cells of code where the RMarkdown command eval = FALSE is specified so as to only display code without output. Setting the below boolean variable to TRUE will allow the cells to evaluate the code, which will take a number of hours. It is recommended to first look through the included examples of output before running the analysis on your local machine.
evalBool <- FALSE
First, we load in the tree. Here we use the ape package to read in a tree in parenthetic format, which returns a “phylo” object. The cell then plots the phylogeny.
phy.string <- "(gar:300,(zebrafish:220,((medaka:120,stickleback:120):60,((esox:100,(central_mudminnow:80,olympic_mudminnow:80):20):25,((grayling1:50,(hucho1:35,(salmon1:20,(charr1:10,(rainbow_trout1:5,coho1:5):5):10):15):15):50,(grayling2:50,(hucho2:35,(salmon2:20,(charr2:10,(rainbow_trout2:5,coho2:5):5):10):15):15):50):25):55):40):80);"
tre <- read.tree(text = phy.string)
plot(tre) # plot tree
In the cell below, the values for the parameters are specified. Most importantly, \(\theta, \sigma^2, \alpha\), and \(\beta_{shared}\). \(\theta\) indicates the optimal value for an Ornstein–Uhlenbeck (OU) process; \(\sigma^2\) determines the magnitude of drift, or OU fluctuations, away from \(\theta\); \(\alpha\) corresponds to stablizing selection, or the pull towards theta; \(\beta_{shared}\) represents the constant value of \(\beta\) under the null hypothesis, where \(\beta\) parameterizes the ratio of population to evolutionary expression variance.
# parameter definitions
# basic parameters needed for simulation
numIndivs <- 4 # number of desired individuals per species - *needs to be defined by user*
numGenes <- 100 # number of genes per dataset - *needs to be defined by user*
numSpecies <- length(tre$tip.label) # number of species
speciesLabels <- rep(tre$tip.label,each=numIndivs) # character vector specifying species names; length should be number of total individuals (species x individuals)
# model parameters - *all need to be defined by user*
theta <- 50 # optimal value for an OU process
sigma2 <- 0.1 # drift, OU fluctuations
alpha <- 0.005 # stabilizing selection; pull towards theta - higher values reduce covariance
betaShared <- 0.1 # constant value of beta under the null hypothesis
The following chunk of code calculates the covariance matrix, which is used to determine parameter values to simulate under. This should be reasonably aligned with the phylogeny - expression covariances vary between pairs of species that are closely and more distantly related. For example, if a disproportionately high \(\alpha\) value is used, the covariance is significantly reduced. It’s important to look at the covar matrix to ensure the magnitudes of covariance are not too small or large.
evolVar <- evolVarOU(tre, rep(alpha, numSpecies*2-2), rep(sigma2, numSpecies*2-2), sigma2/(2*alpha)) # calculates expected evolutionary variance (simga2/2alpha) for all branches
covar <- covarianceOU(tre, rep(alpha, numSpecies*2-2), evolVar) # uses evolVar to calculate covariance matrix
covar_rank_ref<-matrix(nrow=length(unique(c(covar))), ncol=1) # set up a rank matrix that lists the rankings of the covariances, with 1 being the strongest covariance ranking
covar_rank_ref[,1]<-sort(unique(c(covar)), decreasing=T)
covar_rank<-matrix(nrow=numSpecies,ncol=numSpecies)
for(i in 1:numSpecies){
for(j in 1:numSpecies){
rank <- which(covar_rank_ref==covar[i,j])
covar_rank[i,j] <- rank
}
}
rownames(covar_rank) <- rownames(covar)
colnames(covar_rank) <- colnames(covar)
Here we visualize the covariance matrix in a heatmap to confirm appropriate clustering. In this example, a few highlights we notice are 1) high covariance along the diagonal, representing a one-to-one relationship; 2) little covariance between “gar” and the rest of the species; and, 3) higher covariance clustering between the two duplicate clades.
Below is an example of an uninformative covariance matrix, where \(\alpha\) was chosen disproportionately large and drove the covariance to 0 or nearly 0. Since the stabilizing selection value is so high, the covariance matrix does not tell us anything about the relationships between species.
For the theta shift test, we use the same phylogenetic tree, and the same null distribution parameters. In this example, we are interested in analyzing EVE’s ability to detect a shift in \(\theta\) depending on the location of the shift along the phylogeny. We focus on two shift points (refer to image below) on the salmonid phylogeny to simulate varying magnitudes of \(\theta\) shifts. The shift points are defined as the MRCA of the group of interest, indicated by specifying the index of the first species of the group. For example, for our first shift point below, we specify an index of 14, which corresponds to “grayling2”.
shiftPoints <- c(14, 17) # corresponds to the stars on the diagram below, starting from the root down
Below is the definition of a function that simulates a single dataset for each \(\theta_2\) at each shift point. Unlike the power analysis for the betaShared test, where several datasets are needed, the thetaShift power analysis only necessitates a single dataset, because each gene is under the alternative distribution (a two-theta EVE model). The function then performs a twoThetaTest on the simulated dataset. Both the simulated dataset and the resulting log-scaled LRTs are saved to a file. The function takes in a vector of shift points, a vector of \(\theta_2\), and the number of desired genes.
simulate_computeLRT.thetaShift <- function(shiftPoints, thetaVals, numGenes){
for(sP in shiftPoints) {
thetaShiftBool <- 1:Nedge(tre) %in% getEdgesFromMRCA(tre, tips = tre$tip.label[sP:length(tre$tip.label)], includeEdgeToMRCA = T)
for(theta2 in thetaVals){
#simulate data
simThetaShift <- simTwoTheta(numGenes, tre, colSpecies = colSpecies, isTheta2edge = thetaShiftBool,
theta1=theta, theta2=theta2, sigma2 = sigma2, alpha = alpha, beta = beta)
simFile <- paste("sim_sp", sP, "_theta1", theta, "_theta2", theta2, "_sigma2", sigma2, "_alpha", alpha, "_beta", betaShared,".csv", sep="")
write.csv(simThetaShift, simFile)
#run EVE
test <- twoThetaTest(tree = tre, gene.data = simThetaShift, isTheta2edge = thetaShiftBool, colSpecies = colSpecies, cores=4)
LRT <- as.data.frame(test$LRT)
lrtFile <- paste("res_sp", sP, "_theta1", theta, "_theta2", theta2, "_sigma2", sigma2, "_alpha", alpha, "_beta", betaShared,".csv", sep="")
write.csv(LRT, lrtFile)
}
}
}
Using the defined function simulate_computeLRT.thetaShift, a single dataset of 100 genes is simulated for each \(\theta_2\) in the set {5, 20, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 90, 100}, which includes \(\theta_1\) as a point for false positive rate. Each simulated dataset is subsequently tested for a \(\theta\) shift.
thetas <- c(5, 20, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 90, 100) # values for theta2 under the two-theta model, for the first shift point
simulate_computeLRT.thetaShift(shiftPoints, thetas, 100) # run the function on the above vector of theta2, simulating datasets of 100 genes
Example of expression data with a \(\theta\) shift at shift point 14 (“grayling2”), and \(\theta_1=50\) and \(\theta_2=5\). As expected, gene expression is centered around \(\theta_1 = 50\) for the clade preceding the shift point, with a noticeable shift towards \(\theta_2 = 5\) for the species after the shift point.
Next are two examples of the distribution of the LRTs from a thetaShift test. The first is a histogram of the LRTs for a dataset simulated under a single \(\theta = 50\) and all genes are under \(\beta_{shared}\). The second is a histogram of the LRTs for the dataset simulated above. Each LRT corresponds to a single gene, and is visualized on a log-scale. Both sets of LRTs resulted from a thetaShift test, testing for a theta shift at shift point 14.
Below is the definition of the power analysis function that reads in the LRT file and returns the proportion of significant LRTs across all genes for each \(\theta_2\) at a single shift point. The significance threshold was calculated from null distribution simulations, and corresponds to a significance level of 5%. We recommend to conduct null distribution simulations to calculate significance thresholds based on significance levels appropriate for your needs. The function takes in a shift point, a vector of \(\theta_2\) values, and a critical value. The output of the function is a matrix for a particular shift point, where each row corresponds to one of the \(\theta_2\) values, and reports its power.
power.thetaShift <- function(shiftPoint, thetas, critVal){
meanExpDiff <- function(data){
colIndex <- shiftPoint*numIndivs - (numIndivs-1)
group1.meanExp <- mean(data[ , 1:(colIndex-1)])
group2.meanExp <- mean(data[ , colIndex:(numSpecies*numIndivs)])
return(group2.meanExp - group1.meanExp)
}
powerMat <- data.frame(thetaShift = NA, power = NA, meanExpDiff = NA)
for(theta2 in thetas){
fileName.LRT = paste("res_sp", shiftPoint, "_theta1", theta, "_theta2", theta2, "_sigma2", sigma2, "_alpha", alpha, "_beta", betaShared,".csv", sep="")
fileName.sim = paste("sim_sp", shiftPoint, "_theta1", theta, "_theta2", theta2, "_sigma2", sigma2, "_alpha", alpha, "_beta", betaShared,".csv", sep="")
LRTs <- read.csv(fileName.LRT, row.names=1)
sim <- as.matrix(read.csv(fileName.sim, row.names=1))
power <- sum(LRTs$test.LRT >= critVal)/numGenes
if((which(thetas == theta2)) == 1){ #if the first theta, fill in the template matrix
powerMat$thetaShift <- theta2
powerMat$power <- power
powerMat$meanExpDiff <- meanExpDiff(sim)
}
else { #otherwise bind by rows
powerMat <- rbind(powerMat, c(theta2, power, meanExpDiff(sim)))
}
}
return(powerMat)
}
In the next cell, we define the plotting function. The function takes in the same arguments as power.thetaShift, runs the function on the selected shift point, and outputs two equivalent power curves. The distinction between the two power curves is the x-axis. The first plots the proportion of cases rejecting the null hypothesis against the value of \(\theta_2\) under the two-theta EVE model. The second plots the same proportion of cases rejecting the null hypothesis against the mean expression difference between the \(\theta_1\) and \(\theta_2\) groups. This is to compare the theoretical magnitude of the shifted gene expression, due to the theta shift, to the actualized shift in gene expression.
plotPower.thetaShift <- function(shiftPoint, thetas, critVal){
power <- power.thetaShift(shiftPoint, thetas, critVal)
par(mfrow=c(2,1))
plot(power$thetaShift, power$power, ylab = "proportion of cases rejecting the null",
xlab = expression(paste(theta[2])),
main = expression(paste("Power for ", theta, " shift test")),
xlim = c(min(thetas), max(thetas)),
ylim = c(0, 1))
abline(h = 0.05, col="#cb4154", lty = 3)
grid()
plot(power$meanExpDiff, power$power, ylab = "proportion of cases rejecting the null",
xlab = expression(paste("Mean expression difference between ", theta[1], " and ", theta[2], " groups")),
main = expression(paste("Power for ", theta, " shift test")),
xlim = c(floor(min(power$meanExpDiff)), ceiling(max(power$meanExpDiff))),
ylim = c(0, 1))
abline(h = 0.05, col="#cb4154", lty = 3)
grid()
}
In order to run the above function, the shift point needs to be specified. The user changes the value of the first variable below, which will get passed to plotPower.thetaShift. We use 6.28 as the critical value, which was calculated from our null distribution simulations.
shift <- 14 # user-specified shift point
plotPower.thetaShift(shift, thetas, 6.28) # run the power curve function on the specified shift point for a significance level of 0.05
Below are the power curve plots for the twoThetaTest at shift point 14 and 17, respectively, with \(\theta_1=50\). Due to the shorter timescale at the more recent shift point, gene expression will not reach the same level of saturation post-shift if using the same set of \(\theta_2\) values that were used to simulated datasets at shift point 14. Therefore, for the purpose of this tutorial, a separate set of \(\theta_2\) values was chosen, such that the actualized strength of shift is similar to that which was produced for the first shift point. The set of \(\theta_2\) values used for shift point 17 was {-150, -100, -50, -25, 0, 10, 20, 30, 50, 70, 80, 90, 100, 125, 150, 200, 250}. Comparing the two power plots, we notice a slight increase in power for the more recent shift. This is reasonable, given that the magnitude in actualized shift is roughly equal, therefore a shift in \(\theta\) is more detectable on a shorter evolutionary time-scale.